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Abstract 

Process mining represents an important field in BPM and data 
mining research. Recently, it has gained importance also for pracfi- 
fioners: more and more companies are creafing business process in- 
felligence solufions. The evaluation of process mining algorifhms re¬ 
quires, as any ofher dafa mining fask, fhe availabilify of large amounf 
of real-world dafa. Despife fhe increasing availabilify of such dafasefs, 
fhey are affecfed by many limifafions, in primis fhe absence of a "gold 
sfandard" (i.e., fhe reference model). 

This paper exfends an approach, already available in fhe liferafure, 
for fhe generation of random processes. Novelties have been infro- 
duced fhroughouf fhe work and, in parficular, fhey involve fhe com- 
plefe supporf for mulfiperspecfive models and logs (i.e., fhe confrol- 
flow perspective is enriched wifh fime and dafa information) and for 
online settings (i.e., generation of mulfiperspecfive evenf sfreams and 
concepf driffs). The proposed new framework is able fo almosf en¬ 
tirely cover fhe specfrum of possible scenarios fhaf can be observed in 
fhe real-world. The proposed approach is implemenfed as a publicly 
available Java application, wifh a sef of APIs for fhe programmafic 
execution of experimenfs. 
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1 Introduction 


Process mining j 341 gained a lot of attention and is now considered an im¬ 
portant field of research, bridging data mining and business process mod¬ 
eling/ analysis. In particular, the aim of process mining is to extract useful 
information from business process executions. Under the umbrella of pro¬ 
cess mining, different activities could be identified. For example, control- 
flow discovery aims at reconstructing the actual process model starting only 
from the observations of its executions; conformance checking tries to dis¬ 
cover discrepancies between the expected (i.e., compliant) executions and 
the actual ones; enhancement extends a process model with additional infor¬ 
mation obtained from the actual observations. 

In data mining, the term gold standard (or sometimes also referred as 
ground truth) typically indicates the "correct" answer to a mining task (i.e.. 
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the reference model). For example, in data clustering, the gold standard 
may represent the right (i.e., the target) assignment of elements to their cor¬ 
responding clusters. Many times, referring to a gold standard is fundamen¬ 
tal in order to properly evaluate the quality of mining algorithms |[^|^. 
Several concepts, like 'precision and recall are actually grounded on this idea. 
In general, it is possible to identify the concept of gold standard for all min¬ 
ing tasks. 

As for all other mining challenges, the evaluation of new algorithms 
is difficult. In order to properly assess the quality of mining algorithms, 
typically, the evaluation of mining algorithm should be base on real world 
data. However in the context of business processes, companies are usu¬ 
ally reluctant to publicly share their data for analysis purposes. Moreover, 
detailed information on their running processes (i.e., the reference models) 
are considered as company assets and, therefore, are kept private. 

Since few years, an annual event, called BPI challeng^ releases real 
world event logs. Despite the importance of this data, the logs are not 
accompanied with their corresponding gold standards. Moreover, they do 
not provide examples of all possible real world situations: many times, re¬ 
searchers and practitioners would like to test their algorithms and systems 
against specific conditions and, to this purpose, those event logs may not be 
enough. Some other tools, described in the literature (Section]^ can be used 
to construct business processes or to simulate existing ones. However, they 
are very difficult to use and limited in several aspects (e.g., they can only 
generate process models, or can simulate just the control-flow perspective). 

1.1 Research Challenges 

The final aim of this paper is to support researchers and practitioners in 
developing new algorithms and techniques for process mining and busi¬ 
ness process intelligence. Moreover, we put particular emphasis on the 
online/stream paradigm which, with the advent of the big data and Internet 
of things, is rising interest. To achieve our goal we have to face the general 
data availability problem which, in our context, could be decomposed into 
several research challenges: 

Cl build large repositories of randomly created process models with control- 
flow and data perspectives; 

C2 obtain realistic (e.g., noisy) multiperspective event logs, which are re¬ 
ferring to a model already known (i.e., the gold standard), to test pro¬ 
cess mining algorithms; 

C3 generate potentially infinite multiperspective streams of events starting 
from process models. These strems have to simulate realistic scenar- 

^See http://www.win.tue.nl/bpi/2015/challenge 
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ios, e.g., they could contain noise, fluctuating event emission rates, 
and concept drifts. 


Cl and C2 are required in order to test an approach against several 


different datasets, and avoid overfitting phenomena (i.e., tailoring an ap¬ 
proach to perform well on particular data, but lacking in abstraction). 

|C3|is becoming more and more important due to the emerging impor¬ 
tance of big data analysis. Big data is typically characterized |^16| by the 
data volume and velocity (a typical way of dealing with such volume and 
velocity is via unbounded event streams variety (for this, we need 

multiperspective models, not only with the control-flow perspective); vari¬ 
ability (this led us to properly simulate concept drifts 0). 


In this paper we propose a series of algorithms which can be used 
to randomly generate multiperspective process models (Section 0. These 
models can easily be simulated in order to generate multiperspective event 
logs (Section 0. Moreover, the whole approach is design keeping the simu¬ 
lation of online settings (Section 0 in mind: it is possible to generate drifts 
on the processes (i.e., local evolutions) and it is possible to simulate multi¬ 
perspective event streams (which are also replicating the drifts). 

Therefore, the aim of this paper is twofold: on one hand, we aim at de¬ 
scribing the extensions made with respect to our previous work 0, which 
constitutes PLG2. On the other hand, we want to highlight the research 
challenges that need to be solved in order to create realistic and useful test 
data. 

The new approach is implemented in a standalone Java application 
(Section 0 which is also accompanied by a set of APIs, useful for the pro¬ 
grammatic definition of custom experiments. 

In summary, this paper extend the work we presented in 0 since we 
are now able to: 


• generate random process models with additional data perspective (or 
import existing ones); 

• have a detailed control over the data attributes (e.g., by controlling 
their values via scripts); 

• have a detailed control over the time perspective (controlled via scripts); 

• evolve a process model, by randomly changing some of its features 
(e.g., adding/removing/replacing subprocesses); 

• generate a realistic multiperspective event log, with executions of a 
process models and noise addition (with probabilities for different 
noise behaviors); 
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generate a stream of multiperspective events referring to process mod¬ 
els that could change over time with customizable output ratio. 


2 Related Work 


The idea of generating process models for evaluating process mining algo¬ 
rithms has already been explored. 

In particular, van Hee and Liu, in p^ , presented an approach to gen¬ 
erate random Petri nets representing processes. Specifically, they suggest 
to use a top-down approach, based on a step-wise refinement of Workflow 


nets 1361, to generate all possible process models belonging to a particu¬ 
lar class of Workflow network (also called Jackson nets). This approach 
has been adopted, for example, in the generation of collections of process 
model with specific features p7) . A similar and related approach has been 
reported in where authors propose to generate Petri nets according to a 
different set of refinement rules. 

In both cases, approaches do not address the problem of generating 
traces from the developed Petri nets. This task, however, has been ex¬ 
plored in the past, in particular for the generation of process mining ori¬ 
ented logs |12|. The idea is to decorate a Petri net model, using CPN tooQ 
in order to log executions into event log files. Although the approach is ex¬ 
tremely flexible and grounded on a solid tool, it suffers of usability draw¬ 
backs. The most important problem consists of the complexity of whole 
procedure, which is also particularly error prone; the complexity in man¬ 
aging timestamps; the impossibility to simulate data objects (i.e., multiper¬ 
spective models) in a proper way; and impossibility to simulate streams. 

The work reported in 0, extended by this work, provides a first pos¬ 
sible complete tool for the random generation of process models and their 
execution logs. 

The approach described in this paper (namely, PLG2) extends previous 
works in two substantial ways. On one hand it improves the generation of 
random models and their logs by adding data and time perspectives: the 
new version of PLG2 is capable of generating random data objects and sim¬ 
ulate manually defined ones. Moreover, complete and detailed support for 
activity and trace timing is provided as well. Secondly, the whole project 
has been designed with online settings in mind: it is possible to easily (and 
automatically) generate random new versions of process models, in order 
to simulate "concept drifts". Moreover, processes can be simulated to gen¬ 
erate multiperspective (i.e., with data and preserving temporal relations) 
event streams. 


^See http://cpntools.org 


5 







Figure 1: Classes diagram, represented in UML, of the internal structure 
used for the representation of a process model in PLG2. The structure basi¬ 
cally reflects a possible instance of a BPMN model diagram. 

3 Process Models in PLG2 

This section presents the internal representation used to handle business 
processes. The generation of random business process is reported as well. 

3.1 Internal Representation of Business Processes 

In PLG2, the internal structure of a process model is actually rather intu¬ 
itively derived from the definition of a BPMN process model Figure]^ 
depicts the diagram of the classes involved in the modeling. In particular, 
a process is essentially an aggregation of components. Each component can 
be either a flow object, a sequence or a data object. Flow objects are divided 
into: 


• events (which are divided into start or end); 

• gateways (either exclusive or parallel); 

• tasks (which can be activities). 

Please note that it is not possible to instantiate general events or gateways 
or tasks (i.e., those classes are abstract). This technique is used to enforce 
a proper typing of such —otherwise ambiguous— elements. Sequences are 
used to connect two flow objects. A sequence, clearly, imposes a direction 
of the flow. Data objects are associated with activities and can be plain data 
objects or dynamic data objects. A plain data object is basically a key-value 
pair. A dynamic data object, instead, has a value that can change every time 
it is required (i.e., it is dynamically generated by a script). Another char¬ 
acterization of data object is with respect to their "direction": a data object 
can be generated or required by an activity. These two characterizations play 
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an important role in the process simulation phase. We will get into these 
details in Section |5l 

With respect to the previous version of this work |j^, we decided to 
evolve the internal structure into a more general one. This BPMN standard- 
oriented representation is fundamental in order to allow much more flex¬ 
ibility. For example, now, it is possible to load BPMN models generated 
with external tool^ as long as the modeled components are available also 
in PLG2 (for example, it is not possible to load a BPMN model with inclu¬ 
sive gateways). However, since we are restricting to non-ambiguous com¬ 
ponents, we also can convert our processes into more formal languages 
(e.g., it is possible to convert and export the generated models into Petri 
nets, using the PNML file formalj^. 

3.2 Formal Definition of Business Process 

The process representation that we just reported can be structured in a more 
formal definition. Specifically, a process model P can be seen as a graph 
P = (y, E), where V is the set of nodes, and E Q V xV is the set of directed 
edges. However, since in our context not all nodes or edges are equal, we 
can improve the definition of V and £. Let's then define a process as a tuple 
P = ((Estort,Eg„^, A, G, D), (S,C)), where: 

• {Estart, Eend, A, G, D) is a tuple, in which each component is a set of 
nodes with a specific semantic associated. In particular, Egtart is the set 
of starting nodes, E^nd is the set of end nodes, A is the set of activities, 
G is the set of gateways, D is the set of data objects; 

• (S, C) is another tuple. Each component of this tuple is a set of edges. 
Specifically, S C Estart x A U A x Eg„d UAxAUAxGUGxGUGx 
A is a set of sequences cormecting process flow objects. 
CCAxDUDxA, such that yd ^ D \ {{-,d) G C} U {(d, •) G C}| < 
1, is a set of associations going from activities to data objects and from 
data objects to activities. The additional condition guarantees that 
one data object is connected with at most one activity. 

Please note that, the definition just provided partially enforces the se¬ 
mantic correctness of each component involved (for example, it is not pos¬ 
sible to cormect an end event with a gateway or a data object with an event). 

With respect to the data object associations, the C component of a pro¬ 
cess P permits data objects both incoming and outgoing into and from ac¬ 
tivities. This behavior is described in the UML classes diagram, reported in 
Fig. with the direction element of a DataOb ject. 

^An example of supported tool is Signavio, http: / /www .signavio.com 
^See http : / /www . pnml. org for more information on this standard. 
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4 Random Generation of Business Processes in PLG2 


The definition of process just described can be used as general representa¬ 
tions for the description of relations between activities, events, gateways, 
and data objects. In this paper, however, we are also interested in the gen¬ 
eration of random process models, in order to be able to create a "process 
population" capable of describing several behaviors. 

In order to generate random processes, we need to combine some well 
known workflow control-flow patters pTp^ . The patters we are interested 
in are reported in this summarized list: 

• sequence (WCP-1): direct succession of two activities (i.e., an activity 
is enabled after the completion of the preceding); 

• parallel split (WCP-2): parallel execution (i.e., once the work reaches 
the split, it is forked into parallel branches, each executing concur¬ 
rently); 

• synchronization (WCP-3): synchronization of parallel branches (i.e., 
the work is allowed to continue only when all incoming branches are 
completed); 

• exclusive choice (WCP-4): mutual execution (i.e., once the work reaches 
the split, only precisely one of the outgoing branches is allowed to 
continue); 

• simple merge (WCP-5): convergence of branches (i.e., each incoming 
branch results in continuing the work); 

• structured loop (WCP-21): ability to execute sub-processes repeatedly. 

Clearly, these patterns do not describe all the possible behaviors that can 
be modeled in reality, however we think that most realistic processes are 
based on them. Actually, we are also going to extended these patters (with 
the addition of data-objects), in order to generate multiperspective models. 

The way we use these patterns is by progressively combining them 
in order to build a complete process. The combination of such patterns 
is performed according to a predefined set of rules. We implement this 
idea via a Context-Free Grammar (CFG) whose productions are related 
with the patterns mentioned above. Specifically, we defined the follow¬ 
ing context-free grammar Gprocess = {W^/K/P} where V = {P, G, G', G^, 
G©/ G 0 , A, Aact, A^g, D} is the set of the non-terminal symbols, E = {;,(,), O 
, G, 0, 1 /, Csfflrt, Gnrf/ a,b,c, ..., di, d 2 , d^,...} is the set of all terminals 
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Symbols 

Meaning 

^start 

The start event of the process 

^end 

The end event of the process 

0 

Parentheses are used to describe operators precedence 

/ 

Operator, it indicates a sequential cormection 

0 

Operator, repetition of the first parameter by executing 
the second 

0 

Operator, its parameters executed in parallel ("AND") 

0 

Operator, its parameters executed in mutual exclusion 
("XOR") 

a 

Indicates that data object d is required by activity a 

a 

Indicates that data object d is generated by activity a 

a,'b,c,... 

The set of possible activity names 

d\, di, ^ 3 ,... 

A set of possible data objects 


Table 1: All the terminal symbols of the context-free grammar used for the 
random generation of business processes and their corresponding mean¬ 
ings. 


(their "interpretation" is described in Table R is the set of productions: 


P t Sstart / G , Sgyiii 

G ^ G' I Go 

G^ —7- A I (G;G) I (A; Go; A) | {A;G^;A) \ e 
Go —G 0 G I G 0 Go 
G(g) —y G 0 G I G 0 G(g) 

Go —y (G' O G) 

A y Aact I Ai^g 

P^do Aact P^act 

Aact —^ fl I & I C I . . . 

D —y di I £^2 I ^3 I ... 


and P is the starting symbol for the grammar. Using this grammar, a pro¬ 
cess is described by a string derived from Gprocess- 

Analyzing the production rules, it is possible to see that each process 
requires a starting and a finishing event and, in the middle, there must be 
a sub-graph G. A sub-graph can be either a "simple sub-graph" (G') or a 
"repetition of a sub-graph" (Go). 

Starting from the first case: a sub-graph G' can be a single activity A; 
the sequential execution of two sub-graphs (G; G); the exclusive or paral¬ 
lel execution of some sub-graphs (respectively, (A; G®; A) and (A; Go; A)); 
or an "empty" sub-graph e. It is important to note that the generation of 
parallel and mutual exclusion branches is always "well structured". 
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Analyzing the repetition of a sub-graph (Go) it should be noticed that, 
semantically, the repetition of a sub-graph (G' O G) is described as follows: 
each time we want to repeat the "main" sub-graph G', we have to perform 
another sub-graph G; the idea is that G (that can even be only a single or 
empty activity) corresponds to the "roll-back" activities required in order 
to prepare the system to the repetition of G' (which, also, could be a empty 
activity). 

The structure of G© and G© is simple and expresses the parallel execu¬ 
tion or the choice between at least 2 sub-graphs. 

A represents the set of possible activities. In this case two productions 
are possible: Aact which generates just an activity, or A^g which generates 
an activity with a data object associated. In this latter case, two more pro¬ 
ductions are possible: A^ct and Aact the first generates an activity 
with a required data object, the second produces an activity with a gener¬ 
ated data object. Finally, the grammar defines activities just as alphabetic 
identifiers but, actually, the implemented tool "decorates" it with other at¬ 
tributes, such as a unique identifier. The same observation holds for data 
objects. 

Finally, this grammar definition allows for more activities with the same 
name, however in our implemented generator all the activities are consid¬ 
ered to be different. 

In Fig. 1^ an example of all the steps involved in the generation of a 
process are shown: the derivation tree, the string of terminals, and two 
graphical representations of the final process (using BPMN and Petri net 
notations). 

4.1 Grammar Capabilities 

The context free grammar just provided is not capable of generating all 
the possible business models that could be described using languages such 
BPMN or Petri net. In particular, we are restricting to block structured 
ones p3| . Although restricting to block structured processes might seems 
rough, these processes benefit from very interesting properties p^ . More¬ 
over, recently, the process mining community started to focus on this types 
of processes p p4p^ , especially for the soundness properties that they can 
guaranteed. With the adoption of the context-free grammar proposed, we 
decided to stick to this type of language as well. 

Please note that the block structure restriction only affects the random 
process generation part of PLG2: all other components (i.e., process evolu¬ 
tion, and simulations for generation of event logs or stream) are still func¬ 
tioning also with imported (and non-block structured) processes. 

We can also note that there is a straightforward translation of a string 
produced by the PLG2 grammar into the graph representation introduced 
in the previous sections. Therefore, the processes generated with PLG2 can 
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(a) Example of derivation tree. Note that, for space reason, we 
omitted the explicit representation of some basic productions. 


estart) («; {{b; (c ed)-eO f))g)) )ee„d 
(b) The string derived from the above tree. 



(c) BPMN representation, created by PLG2, for the process generated. 

Figure 2: Derivation tree of a process and its string and BPMN representa¬ 
tion. 


11 


























































always be expresses as BPMN (with all perspectives) or Petri net (with just 
the control-flow perspective). 


4.2 Grammar Extension with Probabilities 


As previously stated, we want to randomly generate strings of terminal us¬ 
ing the context-free grammar described earlier. However, in order to pro¬ 
vide the user with a deep control over the final structure of the generated 
processes, we converted the CFG into a stochastic context-free grammar 
(SCFG) 1^1^. This type of grammars have also been widely used for mod¬ 
eling RNA structures 117 

Specifically, to adopt this models, we need to add probabilities asso¬ 
ciated to each production rule. This allows us to introduce user-defined 
parameters to control the presence of specific pattern into the generated 
process. These are the probabilities defined (with indication on whether 
the user is asked to provide the value): 


TTl 

for 

G 

-t 

G© 

required 

712 

for 

G 


G' 

as 1 — TTl 

ns 

for 

G' 


A 

required 

714 

for 

G' 


(G;G) 

required 

TTS 

for 

G' 


G©; A) 

required 

716 

for 

G' 


(A; G©; A) 

required 

n? 

for 

G' 


e 

required 

ns 

for 

G© 


G 0 G© 

computed 

TTg 

for 

G© 


G©G 

as 1 — TTs 

TTlO 

for 

G© 


G 0 G© 

computed 

nil 

for 

G© 


G <S> G 

as 1 — TTio 

ni 2 

for 

A 


^do 

required 

nis 

for 

A 


Plact 

as 1 — 7112 


In order to have a valid grammar the system has to enforce that the 
probabilities of each production sum to 1. Let's define the groups proba¬ 
bilities as: Gpr = {{tzi, TI2}, { 713 ,. . . , Tlv}, {ns, 719}, {ttio, Tin}, { 7 ^^ 12 , 7113 }}. 
Then the following property has to be fulfilled: ^ Pr E Gpr YLpePr P = 1- 
As you can see, this property holds by construction for (tti, 712 } and for 
{ 7112 ,7113}. For {ns, ..., 717 } it is artificially enforced (the user is required to 
insert weights, which are then proportionally adapted in order to sum up 
to 1). 

The two remaining sets (i.e., { ng, TTg} and { ttiq, tth }) are treated slightly 
differently: in this case the user is required to insert the maximum number 
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of possible AND/XOR branches. This information let us dynamically com¬ 
pute the probability values. Let's consider the AND case: in the beginning, 
718 = TTg = 0-5. The system keeps these values unchanged until the max¬ 
imum number of AND branches are generated (i.e., the number of times 
that the production rule G® —)■ G G G® is consecutively executed). Once 
the max value is reached, probabilities are changed in order to stop gener¬ 
ating more branches: Kg = 0 (and therefore Tig = 1). Similar approach is 
adopted for the XOR branches (i.e., {ttio, ttu}). Although this adaptation 
forces the context-free property of the grammar, we think that, for the fi¬ 
nal user, it is much more easy to specify the maximum number of branches 
instead of the actual probabilities. 

In order to provide the user with a more detailed control of the gram¬ 
mar, we require to specify an additional parameter, which is called max¬ 
imum depth. This parameter allows the user to control the depth of the 
derivation tree: once the tree reaches the maximum depth, probabilities are 
artificially changed to these values: 713 = 717 = 0.5, K 4 = ■ ■ ■ = 7Z(, = Tig = 
TTio = 0- This probabilities change forces the derivation tree to limit its 
depth by allowing only new activity or skip patters. 

5 Process Simulation in PLG2 

In order to evaluate process mining algorithms or, in general, to stress busi¬ 
ness intelligence systems, we are not only interested in the random gener¬ 
ation of processes, but we also need observations of the activities executed 
for each process instance, i.e. event logs. This section reports details on 
how we generate multiperspective logs and how to make them more real¬ 
istic by artificially inserting some noise. 

Before getting into the actual simulation algorithms, it is important to 
define the concept of event log. In order to better understand how an event 
log is composed, we clarify that an execution of a business process forms a 
case. The sequence of events in a case is called trace, and each trace, in turn, 
consists of the list of events which refer to specific activities performed. It 
is possible to see each event as a set of attributes (i.e., key-value pairs). The 
fundamental attributes of an event are: (i) the name of the executed activ¬ 
ity, (ii) the timestamp (which reports the execution time of the given event) 
and (in) the activity lifecycle (whether the event refers to the beginning or to 
the completion of an activity). The lifecycle attribute is important when a 
recorded activity lasted for a certain amount of time (i.e., it is not instanta¬ 
neous): in this case, two events are recorded, one when the activity begins 
and another when the activity ends. 

More formally, given the set of all possible activity names A, the set 
of all possible case identifiers C, the set of timestamps T, and the set of 
lifecycle transitions £ = {start, complete}, it is possible to define an event e 
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as a tuple, such as e = [c,a,t,l) G C x Ax T x In this case, it describes 
the occurrence of activity a, with lifecycle transition /, for the case c, at 
time t. Please note that the attributes reported here are just the minimum 
required ones: other attributes can be added to the event (in general, each 
data object of the process will generate a new attribute). Given an event 
e = (c, a, t, /, fli,..., Uk), it is possible to extract each field using a projection 
operator: #case(e) = c; #activity(e) = fl; #time(e) = f; #iifecycie{e) = I, and so on. 

Given a finite set = {1,2,.. .,n} and a "target" set A, we define 
a sequence a as a function a : N+ —> A. We say that a maps indexes to 
the corresponding elements in A. For simplicity, we refer to a sequence 
using its "string" interpretation: a = {si,...,s„), where S/ = i7{i) and 
Sj G A. Moreover, we assume to have concatenation and cardinality op¬ 
erators: respectively {e\,... ,el) ■ {e^,... ,el,) = {e\,...,el,el,...,el,) and 
|(ei,.. .,e„)| = n. 

In our context, we use timestamps to sort the events. Therefore, it is safe 
to consider a trace just as a sequence of events. In turn, a log is just a set 
of traces. Therefore, traces are allowed to overlap: given a log I with two 
traces fi = (e {,... ,el) G I and t 2 = (e{,...,G / it is possible to have 
that ^ ^time (^l) — ^time (^«) ^time (^l) — ^time (^l) — ^time (^m)' 


5.1 Multi-Perspective Simulation 

The procedure for the generation of logs out of business process, basically, 
consists of a simulation engine running a "plain-out activity" p4| . However, 
in order to properly simulate all the perspectives required, some conven¬ 
tions need to be defined. 

The structure of process models that PLG2 can handle is restricted to 
the family of BPMN models with an unambiguous semantic. Therefore, in 
PLG2, it is possible to consider a process as its equivalent Petri net repre¬ 
sentation |27 The main advantage, in this case, is that it is possible to 


play the token-game for simulating the process. 

The procedures for the simulation of a process instance are reported in 
Algorithm [^1^ and 1^ These procedures use the following additional func¬ 
tions: in, out and rnd. Let's assume a process P = {{Estart,^endr^rG,D), 
(S, C)), as described in Section 3.2 Given c G A U G, we can define in(c) = 
{c' I (c',c) G S} and out(c) = {c' \ (c,c') G S}. rnd(s), instead, given a 
general set s, returns a randomly selected element e such that e G s. 

Algorithm represents the main entry point of the simulation: it ex¬ 
pects, as input, a process model and the number of traces to simulate. Then, 
it basically iterates the generation of single traces in order to populate the 
log. Line|^is required in order to properly sort the events, and line|^intro- 
duces, if required, some noise into the trace. The noise generation will be 
described in Section l5^ 

Algorithm]^ is in charge of the control-flow simulation. The algorithm 
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Algorithm 1: A general simulation procedure 


1 

2 

3 

4 
6 
8 
9 

10 

11 


Input : P = {{Estart, E^„^, A, G, D), (S, C)): the process to simulate 
tot: the number of traces to generate 
Output: An event log 
log ^ 0 

for i = 1 up to tot do 

f <— ( ) > Generate a new trace 

SimulateProcess(P, t, md{Estart), -L)} > Algorith 

sort(f) t> Sort events w.r.t. their times 

add trace-level noise to f > See Section 5. 

loglog U {f} > Add the trace to the log 

end 

return log 


expects as input the process to simulate, the component to analyze, and the 
sequence (i.e., the edge) that brought the analysis to the current component. 
First of all, the algorithm requires the definition of a global set t. This set 
is fundamental for the "token game": making an analogy with Petri nets, 
it stores the current marking (i.e., the tokens configuration). In our case, 
however, the set contains the edges that are "allowed to execute". 

The idea behind Algorithm is to call itself on all elements (events, 
tasks, and gateways) of the process. Then different behaviors are per¬ 
formed, based on the analyzed element. Specifically, if the element is a 
task, it is simulated (line |^, and then the algorithm is called on the follow¬ 
ing component (linej^. If the analyzed element is a XOR gateway, then the 
call is just passed to one (randomly picked) outgoing element (linep^. If 
the currently analyzed element is an AND gateway, we made the assump¬ 
tion that it can be either a split or a join (not both at the same time). It is 
possible to discriminate between split and join by checking the number of 
outgoing edges (line [l^ and [2T]|. If the gateway is an AND split, it is nec¬ 
essary to make one call for each AND branch (linej^. If the gateway is a 
join, then it is necessary to check whether all the incoming branches are ter¬ 
minated (lines [22p8| |. If this is the case, then the flow is allowed to continue 
with the following activities (line|^. Please note that we omitted here the 
description of the token handling (i.e., insertion, check, and removal) for 
readability purposes: it is managed in a standard way. 

Algorithm]^ is responsible for adding an activity to the provided trace. 
The algorithm first creates the start event for the activity and populates it 
with the standard fields (lines p]|4]|. If the activity has a non-instantaneous 
duration, the algorithm also creates a complete event (Irne [l7p0| . 

In order to determine the activity time and its duration, the system 
needs to check whether the user specified any of these parameters. If no 
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Algorithm 2: Simulate Process 

Input ■. P = G, D), (S, C)): the process to simulate 

t: the trace containing the simulated events 
c: process component to simulate 
s = (i,c): incoming sequence 

1 tokens <— globally defined set of tokens (i.e., sequences), initially the empty set 


2 if c is a Task or c is a XOR gateway then 

3 if c is a Task then 

4 Simulate Activity(P, t, c) 

5 tokens <— tokens \ {s} 

6 end 

7 if I out(c)| > 1 then 

8 n <— rnd(out(c)) 
component 

9 tokens <— tokens U {(c, w)} 

10 Simulate Process(P, t, n, {c,n)) 

11 end 

12 else if c is an AND gateway then 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 


Algorithm 


t> Randomly select the following 


23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 


if 


> Update tokens 
0 Recursion 


> We treat c as a split 


> Add all tokens 


0 Recursive call 


(i.e., incoming 


out(c)| > 1 then 
tokens ■<— tokens \ {s} 
forall the n 6 out(c) do 
I tokens ■(— tokens U { (c, w) } 

end 

forall the n 6 out(c) do 
I Simulate Process(P, t, n, {c,n)) 

end 

else h In this case, we treat c as a join 

allBranchesSeen ■(— true 
l> Check whether all branches 
edges) have been executed 

forall the p € in(c) do 
if (p, c) ^ t then 

allBranchesSeen <— false 
break 

end 

end 

if allBranchesSeen is true then 
forall the p € in(c) do 

tokens ■<— tokens \ {(p, c)} 

end 

n -s— out(c) 

tokens ■(— tokens U {(c, n)} 

Simulate Process(P, t,n, (c,«)) 

end 

end 


> Remove tokens 


> Get the outgoing edge 
> Update tokens 
0 Recursive call 


end 
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Algorithm 3: Simulate Activity 


Input : P = {{Estart, E,„d, A, G, D), {S, C)): the process 
t: the trace that contain the new events 
a: activity to simulate 

> Generate the activity start event 
^start ^ new event referring to activity a 
#activity {estart) ^ the name of activity a 
^time (estart) ^ activity time 

^lifecycle(estart) ^ StClVt 

t> Decorate with all generated data objects 

5 forall the d E {d \ {a,d) ^ C} do 


> Details in text 


#d{estart) ^ value generated for d 


end 


> Decorate with all required data objects 

if |f| > 1 then 

forall the d G {d | {d,a) G C} do 
lastEvent ^ f(|f| — 1) 

#^f(lastEvent) value generated for d 

end 

13 end 

14 add event-level noise to Cstart 

15 t t ■ {Cstart) 

t> Generate the activity completion event 

16 if activity a is not instantaneous then 
Ccompiete ^ new event referring to activity a 
^activityiecompiete) ^ the name of activity a 
^time{ecomplete) t ^time (estart) + activity duration 
^lifecycle(ecomplete) t Complete 


0 See Section 


5.2 
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20 


21 

22 

23 

24 

25 


> Decorate with all generated data objects 
forall the d G {d | {a,d) G C} do 
I #d{estart) ^ value generated for d 

end 

add event-level noise to e, 

t ^ t ■ {ecomplete) 


^complete 


> See Section 


5.2 


26 end 


specifications are reported, then the activity is assumed to be instantaneous 
and to execute a fixed amount of time after the previous one. However, 
as said, the user can manually specify these parameters. To do so, the 
user has to provide two Python functions: time_after (caseld) and 
time_lasted (caseld) . Both these functions are called by the simula- 
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Listing 1: Example of random activity duration (between 5 and 15 min¬ 
utes) and random time after the execution of an activity (between 1 and 5 
minutes). 


from random import randint 

# This Python script is called for the generation of the time 

related features of the activity. Note that the functions 
parameters are the actual case id of the ongoing simulation ( 
you can use this value for customize the behavior according to 
the actual instance). 

# The time_after(caseid) function is returns the number of second 

to wait before the following activity can start, 
def tirae_after(caseid): 

return randint(60*1, 60*5) 

# The time_lasted(caseid) function returns the number of seconds 

the activity is supposed to last 
def time_lasted(caseid): 

return randint(60*5, 60*15) 


tor with the caseid parameter valued to the actual case id: this allows 
the two functions to be case-dependent (for example, it is possible to save 
files with contextual information). Specifically, time, after (caseid) is 
required to return the number of seconds that have be (virtually) waited 
before the following activity is allowed to start, time.lasted (caseid), 
instead, has to return the number of seconds that the activity is supposed 
to last. This approach is extremely flexible, and allows the user to make 
very complex simulations. For example, it is possible to define different 
durations for the same activity depending on which flow the current trace 
has followed so far, or with respect to the number of iterations on a loop. 
Examples of such functions are reported in listing]^ 

Once the time-related properties of an activity are computed. Algo¬ 
rithm 1^ has to deal with the data objects associated with the current ac¬ 
tivity. In particular, generated data objects (see Section [3d] l are supposed to 
generate values written as the current activity's attribute. Required data ob¬ 
jects, instead, are written as attributes for the activity which precedes the 
current one in the trace. The ratio behind this decision is that generated data 
objects are assumed as values written as output of the current activity. Re¬ 
quired data objects, instead, are variables that has to be observed prior to the 
execution of the current activity. However, since the simulation is driven 
by the control-flow, it is necessary to adjust the variable values a posteriori. 

In order to better understand the utility of required data objects, let's 
consider the process fragment reported in Figure In this case, the simu¬ 
lation will first perform "Activity A" and then either "Activity B" or "Ac- 
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Figure 3: A process fragment of a XOR split gateway with two branches, 
each of them starting with a different required data object. 


Listing 2: Example of script for the generation of random integer values (in 
the range 0,1000). 


from random import randint 

# This Python script is called for the generation of the integer 
data object. Note that the parameter of this function is the 
actual case id of the ongoing simulation (you can use this 
value to customize your data object). The function name has to 
be "generate". 

def generate(caseld): 

return randint(0, 1000) 


tivity C". However, all the times the simulation engine generates "Activity 
B", it also decorates the event referring to "Activity A" (belonging to the 
same trace) with d = 1 . Instead, all the times the simulation engine gen¬ 
erates "Activity C", it also decorates the event referring to "Activity A" 
(belonging to the same trace) with d = 2. Therefore, an analysis system 
fed with such example trace could infer a correlation between the value of 
the attribute d of "Activity A", and the following activity. 

From the characterization reported in Fig.[^ and described in Section]^ 
it is possible to distinguish two types of data objects: plain data objects and 
dynamic data objects. This distinction is required by the simulation engine in 
order to properly deal with them: plain data objects are treated as fixed val¬ 
ues (i.e., the simulation generates always the same value); dynamic data ob¬ 
jects are actually Python scripts whose values are determined by the execu¬ 
tion of the script itself. These scripts must implement a generate (caseld) 
function which is supposed to return either an integer or a string value 
(depending on the type of data object). An example of integer dynamic 
data object script is reported in listing Please note that, also in this case, 
the function is called with the caseld parameters valued with the actual 
instance's case id, providing the user with an in-depth, and case depen- 
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dent, control over the generated values. ScriptExecutor component 
(and its subclasses), reported in Figure are in charge of the execution 
of the Python scripts. 

There is no particular limit on the number of plain and dynamic data 
objects that a task can have, both required and generated. Clearly, the 
higher the number of data objects to generate, the longer the simulation 
will take. 

The current random process generator component is only able to gen¬ 
erate plain data objects. Specifically, the generated data objects are named 
variable_a, variable_b, ... and the values they return are just random 
strings. 

Considering all the simulation aspects described in this section we can 
conclude that our approach is able to simulate multiperspective models in 
order to generate multiperspective logs. The "multiperspective" term, in 
this context, means that the data generate does not only refer to the control- 
flow perspective, but have also detailed timing properties and the data gen¬ 
erate could be extremely articulated and tailored to the actual simulation 
scenario. 


5.2 Noise Addition 


In order to generate more realistic data, we introduced a noise component. 

The noise component plays a role after the process has been simulated 
and a trace is available. Specifically, this trace is fed to the noise component 
which could apply noise at three different "levels": (i) at the trace level (i.e., 
noise which involve the trace organization); (ii) at the event level (i.e., noise 
which involve events on the control-flow perspective); (iii) at the data object 
level (i.e., noise which involve the data perspective associated to events). 
The actual noise generation is driven by the parameters set by the user. 
Such parameters, basically, indicate the probability of applying a particular 
noise type to the trace. Setting all these values to zero implies having trace 
with no noise. 

The noise details for the trace and -partially- for the event level have 
already been discussed in the literature and reported in details in [11 T^. 


The idea is that the user has to specify the probability of all the different 
noise events, and the simulator will apply the corresponding effect. Possi¬ 
ble trace-level noise phenomena are: 


• a trace which is missing its head (i.e., its first events). In this case the 
user has also to specify the maximum size for a head (which will be 
randomly chosen between 1 and the provided value); 

• a trace which is missing its tail (i.e., its last events). In this case the 
user has also to specify the maximum size for the tail (which will be 
randomly chosen between 1 and the provided value); 
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• a trace which is missing an episode (i.e., a sequence of contiguous 
events). In this case the user has also to specify the maximum size 
for an episode (which will be randomly chosen between 1 and the 
provided value); 

• an alien event introduced into the trace, in a random position, with 
random attributes; 

• a doubled event on the trace. 

Possible noise effects at the event level are: 

• the random change of the activity name of an event; 

• the perturbed order between two events of a trace (since the times¬ 
tamp attributes of two events are involved, we consider this as an 
event-level noise). 

Finally, possible data object-level noises are: 

• random modification of an integer dynamic data object. In this case 
the user has also to specify the maximum value A of the change: given 
the old value v, the new one (i.e., after noise) will he v + S, with 3 
random in the closed interval [—A, -|-A]; 

• random modification of a string dynamic data object (replacement of 
the current string with a randomly generated new one). 

In order to simplify the noise configuration, we already defined some 
basic "noise profiles", such as: (i) complete noise; (ii) noise only on the 
control-flow; (in) noise only for data-objects; (iv) noise only on activity 
names; (v) no noise at all. 

6 Stream Simulation in PLG2 

As stated before, PLG2 explicitly was design for the simulation of online 
event streams. Specifically, in this context, we adopted the definition of 
stream already used in the process mining community |[^[^. 

6.1 Continuous Data Generation 

An event stream differs from an event log in two fundamental aspects. First 
of all, and event stream has not a predefined end (i.e., the user can gener¬ 
ate as many events he wants, so the simulation can last for an unspeci¬ 
fied amount of time). The second distinction consists in keeping the events 
sorted by their time, and not grouped. 
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Figure 4: Graphical representation of an event stream. Boxes represent 
events: their background colors represent the case id, and the letters in¬ 
side are the activity names. First line reports the stream, following lines are 
the single cases. 


Therefore, differently from an event log (which is a set of sequences, 
i.e., the traces), an event stream is just a sequence of events. Therefore, the 
only property that must be enforced is that, given an event stream a, for 
all indexes i available, # time {o'(i)) < ^ time {o'{i + !))• Instead, it will happen, 
for some indexes i, that #case{o'{i)) 7 ^ #case{o'{i + 1)) (i-e., contiguous events 
refer to different process instances). In this last case, the two events are said 
to belong to interleaving traces. Figure [^reports a graphical representation 
of three interleaving traces and how the actual stream looks like. 

From an implementation point of view, the idea is to create a socket, 
which is accepting connections from external clients. PLG2, then, "emits" 
(i.e., writes on the socket) events that are generated. The challenge, in this 
case, is to let the system simulate and send events for a potentially infinite 
amount of time. 

In order to generate our continuous stream, we need to ask the user for 
two parameters: the maximum number of parallel instances running at the 
same time, and the "time scale". The first parameter is used to populate the 
data structures required for the generation of the stream. Then, since the 
event emission is performed in "the real time" (opposed to the "simulated 
time"), it might be necessary to scale the simulation time in order to have 
the desired events emission rate. To this end we need a time multiplier, 
which is expected to be defined in (0, 00 ]. This time multiplier is used to 
transform the duration of a trace (and the time position of all the contained 
events), from the simulation time to the real time. 

The procedure for the generation of streams is reported in Algorithm]^ 
It starts by allocating as many priority queues as the number of paral¬ 
lel instances of the stream (line |^. These queues are basically used as 
events buffer. Then, the procedure starts a potentially infinite loop for the 
events streaming. At the beginning of this loop, the algorithm first needs 
to check whether the buffer contains enough events. If this is not the case 
(linej^, then a new process instance is simulated (line|^ using Algorithm]^ 
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and, after applying the time scale (line [T^, all its events are added to the 
event buffer (linej^. Events are enqueue considering their time order (i.e., 
events with lower timestamps have higher priority) Ij Once the algorithm 
is sure about the availability of events, it extracts (and removes), from the 
buffer, the event with highest priority (line[l^. At this point, it is necessary 
to make happen the mapping between the simulation and the real time: 
the algorithm has to wait for a certain amount of time, in order to ensure 
the correct event distribution in the real time (line[^. After such wait, the 
event can finally be emitted (line|^, and all cormected clients are notified. 

Please note that, every time the algorithm has to repopulate the buffer, it 
asks the framework for the process which has to be simulated (line|^. This 
is a fundamental point: the user can change the process for the simulation, 
without stopping the current stream emission, and if such change occurs, a 
concept drift will be observed. Concept drifts p^l8|25p^ represent another 
important characteristic, which fundamentally differentiate event streams 
from event logs and, therefore, identify a requirement. 

Please note also that, in order to have a more accurate mapping between 
simulation and real time, the implementation of the buffer population pro¬ 
cedure (lines [/pr] of Alg.|^ can be executed in an external thread]^ 

In order to assess the feasibility of this algorithm we run several exper¬ 
iments. In particular, we generated the process model reported in Fig. 
This process contains 10 activities, one parallel execution, one loop and one 
generated data object. Then, we run the streamer of this process for two 
hours, generating, in total, 4174 different traces and 67 856 events. The av¬ 
erage throughput of the streamer, after an initial configuration stage, was 
set at 9.4 events per second. Figure reports the memory requirement 
of the approach: the evolution of the buffer size and the total number of 
events sent are plotted against the running time of the actual stream. As 
the plot shows, the average number of stored event is between 300 and 400 
events, which represents an affordable memory requirement for any hard¬ 
ware configuration available nowadays. 

6.2 Concept Drifts for Process Models 

One common characteristic of online settings is the presence of concept 
drifts. As described in the previous section, the tool is able to dynamically 
switch the source generating the events. However, in order to change the 
stream source, a new model is required. To create another model, two op- 

^Implementation details are skipped here, but some time manipulations are required 
in order to insert the new trace after all events already enqueued and keeping a certain 
amount of time from the last event. 

^This cannot ensure a completely correct mapping, however the difference has empirically 
seen negligible. 
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Algorithm 4: Stream 


Input : p: the number of parallel instances 
m: time multiplier 

0 Initialization of the data structures 

1 queues -<—0 > This is an event buffer 

2 for i = 1 up to p do 

3 I queues ^ queue U {a new priority queue} 

4 end 

5 Z ± 

6 forever do 

> Populate the event buffer 

7 if Iqueues] < 2p then t> Here |queues| is the sum of sizes 
of all queues contained. Although the inequality 
could be <1, we prefer to use 2p since these 
operations could be performed in a different 


8 

9 

10 

11 

12 

13 


concurred thread 

proc ^ the process to simulate 

Z ^ simulate a new trace for proc > Alg. 

scale the trace duration (and events times) according to m 
distribute the events of t (sorted by their time) to the queue 
with the highest priority of the last event 

end 

> The actual streaming 

e extract (and remove) the event with highest priority from all 
queues > From queues 
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if Z 7^ ± then 

W #h'me (s) — 

wait for w time units 
end 

l^e 

#time{e) ^ now 
emit e 


21 end 


P To all connected clients 


tions are available: one is to load or generate from scratch a model; the 
other is to "evolve" an existing one: this is an important feature of PLG2. 

To evolve an existing model, PLG2 replaces an activity with a subpro¬ 
cess generated using the context-free grammar described in Section]^ This 
operation, which takes place randomly, and with a probability provided by 
the user, is repeated for each activity of the process. The new process could 
be very similar to the originating one, or very different, and this basically 
depends on the probability configured by the user. 
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(a) Process model used for performance computation of the stream reported in this section. 


70000 

60000 



00:00 00:15 00:30 00:45 01:00 01:15 01:30 01:45 02:00 

Stream running time (hh:mm) 

(b) Size of the buffer and total number of events sent for the process re¬ 


ported in Fig. 


5a 


Figure 5: Simulations details: the process model used and the size of the 
buffer. The entire simulation lasted for two hours. 


For example. Figure reports two evolutions of the process model, 
which has been randomly generated and which is reported in (a). In (b) 
the procedure applies the evolution by replacing "Activity G" with the se¬ 
quence of three activities ("Activity H", "Activity I" and "Activity J"). In (c), 
the evolution involves "Activity D" (and the associated data object) which 
is replaced with a skip (i.e., it is removed). 

Please note that an evolution could involve the creation or the deletion 
of data objects as well. Process evolution, therefore, can be used for the 
definition of particular experiments (e.g., a stream with random concept 
drifts occurring every 1000 events). 
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variable_b = rffi/kviaqb31v variable_a = 7tt82me7l613(3 




Activity E 


Activity F 







(a) Starting process model, randomly generated. 



(b) First evolution of the process model. In this case, activity G has been replaced 
by a sequence of three activities (H, I, J). 


Activity A 



Activity E 


Activity F 








variable_h = rffvkviaqbSlv 


Activity C 


(c) Second evolution of the process model. In this case activity D — and the asso¬ 
ciated data object 'variable_a' — have been removed. 

Figure 6: A process model randomly generated with two sequential evolu¬ 
tions. Please note that new activities can be introduced or removed (with 
the associated data objects). 

7 Implementation Details of PLG2 

PLG2 has been implemented in a Java application. It is available as open 
source project and also binary files are provided for convenience]^ The 
project APIs can also be easily used to randomly generate processes or 
logs. Listing reports the Java code required to generate a random pro¬ 
cess model; to simulate it in order to create 1000 traces; and to export the 

^See http://plg.processmining. it and https://github.com/delas/plg 
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Listing 3: Java fragment for the creation of a new process, its simulation (to 
generate 1000 traces), and its export as Petri net. 


// process randomization 
Process p = new Process( "test ") ; 

ProcessGenerator.randomizeProcess(p, 

RandomizationConfiguration.BASIC_VALUES); 

// log simulation to generate 1000 traces 
XLog 1 = new LogGenerator(p, 

new SimulationConfiguration(1000)).generateLog(); 

// export as pnml 

new PNMLExporter().exportModel(p, "p.pnml"); 


process as a Petri net (using the PNML standard). 

The current implementation is able to load BPMN files generated with 
Signavio or PLG2. It is also possible to export a model as PNML |21| or 
PLG2 file. Moreover, it is possible to export graphical representation of 
the model (both in terms of BPMN and Petri net) using the Graphviz file 
format p^ . The simulation of log files generates a XE^compliant |j^ 
objects, which can be exported both as (compressed) XES or (compressed) 
MXML. These formats are widely used by most process mining tools. 

Eigure reports a screenshot of the current implementation of PLG2. 
Erom the picture it is possible to see the main structure of the GUI: there is 
a "workspace" list of generated processes on the left. The selected process 
is shown on the main area. Right clicking on activities allows the user to 
set up activity-specific properties (such as times, or data objects). On the 
bottom part of the main application it is possible to see the PLG2 console. 
Here the application reports all the log information, useful for debugging 
purposes. The application dialog in the foreground is used for the configu¬ 
ration of the Python script which will be used to determine the time proper¬ 
ties. As shown, specific syntax highlighting and other typing hints (such as 
automatic indentation) helps the user in writing Python code. The stream 
dialog is also displayed in foreground. As can be seen, in this case, it is pos¬ 
sible to dynamically change the streamed process and the time multiplier. 
The right hand side of such dialog (in the rectangle with black background), 
moreover, reports "a preview" of the stream: 30 seconds of the stream are 
reported (each round dot represents, in this case, up to 3 events). 

As stated previously, some components of PLG2 require the execution 
of Python scripts. To deal with that we used the Jython frameworl^which. 


®See http://www.xes-standard.org 
®See http://www.jython.org 
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Figure 7: Screenshot of PLG2. From the current visualization it is possible 
to see that several models have been created in the workspace, the dialog 
for the time rules configuration for "Activity C", and the console, which 
shows general information on what is going on. The stream dialog is re¬ 
ported as well. 


basically, is an implementation of Python which can run in Java. The inter¬ 
action between Java and Python objects is encapsulated in the Script Executor 
hierarchy, reported in Fig. 

Since it is possible to repeat the code fragment reported in Listing]^ as 
many times as required, we are able to fulfill [Cl] (Section 1 1.1 1 | . The detailed 
process simulation, the advanced data values generation and the noise con¬ 
figuration are necessary to create realistic multiperspective event logs and 
therefore to accomplish |C2[ Finally, the feasibility of the stream procedure 
reported, together with its main features (such as the possibility to generate 
multiperspective streams, the dynamic change of the originating process 
model and the possibility to adapt the time between events emitted) makes 
possible to successfully cope with|C3| 

8 Case Studies 

In this section we would like to propose two possible scenarios in which 
PLG2 could easily be applied. In particular we will show a multiperspec¬ 
tive analysis, performed in offline setting; and a control-flow discovery ac¬ 
tivity in online scenario. 
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Figure 8: Process model used for the offline simulation. 


8.1 Offline Setting 

On the first case study, we would like to analyze both the control-flow and 
the data perspectives of a log files. 

To perform our test, we generated a random process model, and then 
we manually slightly modified it, in order to fit our goals. Specifically, we 
added required data objects to activities C, D, and E. These data objects 
are all named variable.a, but each of them has a different value. The 
generated model is reported in Figure]^ and represents our gold standard. 
Therefore, when we perform the data analysis, we expect the presence of a 
variable influencing the control-flow for those activities. 

To perform our simulation, we generated a log with 2000 traces and 
then we analyzed it using Pro]v(^|40|. For the control-flow discovery anal¬ 
ysis we run the Inductive Miner 1 2 ^ algorithm. Then, we converted the 
generated model into a Petri net. The result is reported in Figure As 
we can see, from the behavioral point of view, the mined model reflects 
the original one, except for the data perspective (which cannot be extracted 
with Inductive Miner). Starting from the Petri net mined, we run the Data¬ 
flow Discovery plugin pf)| in order to add data variables governing the 
control-flow. The result, which is reported in Figure [9b} shows the presence 
of a variable named variable.a which is written by activity A, and read 
by activities C, D, and E. The screenshot also reports the actual guard for 
activity E (i.e., the value that is required in order to execute that activity). 
Both the control-flow and data flow mined reflect the expected ones. 

As a second test, we mined the control-flow using the tool Discop] The 
control-flow discovered by the tool is shown in Figure [^ The formalism, 
adopted by the tool for the representation of business processes, allows us 
to see, basically, only direct following relationships. As we can note, activ¬ 
ities C, D and E are executed, respectively 676, 622, and 702 times. Since in 
total we have 2000 traces, this is an indication that maybe those activities 
are mutually exclusive (although this is not necessary). Instead, activities 
H and I are both executed 2000 times but we see there are connections be- 

^®See http://www.promtools.org 

^^See http://fluxicon.com/disco/ 
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Petri net 




(b) Result of mining the data flow, which dec¬ 
orates the mined Petri net. 


Figure 9: Results of mining activities (control-flow and data flow) per¬ 
formed using ProM plugins. 


tween them. These cormections indicate that the activities are not executed 
in a specific order (i.e., they are parallel). These behavioral characteristics 
reflect the gold standard. 

8.2 Online Setting 

For the second case study, we decided to analyze the online scenario with 
concept drifts. To achieve this goal, we created a second model (M 2 ), differ¬ 
ent from the previous one (Mi). Then we started streaming events referring 
to Ml- 

In the meanwhile, we configured the stream mining plugin implemented 
in ProM and described in |Q. Specifically, we used the mining approach 
based on Lossy Counting, with parameter e = 0.032. We also configured 
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Figure 10: Result of mining the log using the tool Disco. 


the miner to update the graphical representation of the process model ev¬ 
ery 500 events received. The sequence of models extracted is reported in 
Figure [TT| 

The first two models extracted are not equivalent to the expected one, 
since the miner needs several observations in order to reinforce and accept 
the patterns. Figure [Tl^ shows the model which is equivalent to the gold 
standard (in this picture split/join semantics are not reported for readabil¬ 
ity purposes). At this point, we decided to change the stream, and emit 
events referring to M 2 (i.e., we simulated the occurrence of a concept drift). 
The first model extracted after such concept drift is reported in Figure lld| 
and shows both process models Mi and M 2 embedded into the same rep¬ 
resentation. This is a known phenomenon, and is due to the inertia of the 
stream-based approaches. After some events, since the miner is not receiv¬ 
ing anymore observations from Mi, it starts to forget its structures. After 
some more events, the second model M 2 is definitely discovered, as shown 
in Figure [TT^ and no traces of Mi are left. 


With these two case studies, we tried to show some of the possible us¬ 
ages of the described approaches. In these tests, we just used algorithms 
already available in the literature. However, the primary goal should be 
testing new ones. Moreover, in the described cases, we just manually com¬ 
pared the mined models and the expected ones but this could be done au¬ 
tomatically. Finally, since we provide libraries to perform all functionalities 
via Java code, batch approaches could be designed, in order to perform the 
same operations against large repositories with models expressing very dif- 
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(a) First model (after 500 events). (b) Intermediate model. 



(c) Correct first model (after 2000 events). (d) Model mined after the drift occurred. 



(e) Intermediate model. (f) Intermediate model. 


.hTD 

i I 

(g) Correct second model extracted (about 
10000 events after the drift occurred). 

Figure 11: Evolution of discovered models during a process stream simu¬ 
lation with a concept drift occurred. 


ferent behaviors. 

9 Conclusions and Future Work 

This paper describes PLG2, which is the evolution of an already available 
tool. The old tool was able to randomly generate process models and sim¬ 
ulate them. The new tool introduces updates on two sides: on one hand it 
extends the support to multiperspective models (by adding detailed con¬ 
trol of time perspective and introducing data objects); on the other hand, 
full support for the simulation of online settings (generating drifting mod¬ 
els and simulating event streams) is provided. 

We believe, that the combination of the two newly introduced aspects 
allows the tool to be a valid instrument for the data mining, information 
systems, and process mining community, since it allows the simulation of 
very complex scenarios. As the predecessor of this tool has proven, by its 
wide adoption, we think that the new features of PLG2 are important in 
order to push and help researchers to tackle the new challenges that up- 
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coming settings propose (for example, big data requires to handle streams 
of multiperspective data). 

We think that a lot of work is necessary in this field: the simulation of 
real scenarios is a very tough and broad task. In particular, it is important to 
investigate how to generate even more realistic scenarios. To achieve such 
realism, it is necessary to work both on the model generation (control-flow, 
time and data perspective) and on the simulation (for example identifying 
new types of noise). For example, the introduction of noise on the model¬ 
ing could be considered (e.g., inserting or removing edges randomly, or in 
specific contexts). 

An example of possible future work consists in the ad hoc simulation 
of the social perspective (identifying common patterns and possible be¬ 
haviors) which, right now, is already possible, but just through the data 
perspective (e.g., generating data that describe the originators). Another 
future work, on the simulation part consists in introducing noise referring 
not to the trace/event modification, but to the distribution of the cases (i.e., 
not all control-flow paths are equally probable). 
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